An Efficient Filtration Method in Biological Sequence Databases
نویسنده
چکیده
Sequence comparison is one of the most important primitive operations in bioinformatics. Roughly speaking, this operation finds which parts of sequences are alike and which parts are different. As the size of a sequence database scales to millions of base pairs, it becomes impractical to search the whole database with sequence alignment methods based on the dynamic programming approach which yields quadratic time complexity. Filtration methods are thus proposed in order to screen out most unrelated data sequences in the preprocessing stage. However, existing filtration methods either incurs false negatives or retains too many candidates. In this paper, we proposed a filtration method called Transformation-based Database Filtration method (TDF) which consists of two phases. First, we divide the data sequences into several blocks, each of which is transformed into a feature vector by Haar wavelet transform. Then, we build an index for them. In the second phase, we search the index and extract those candidate blocks whose distance to the feature vector of the query sequence is less than a predefined threshold. Finally, for each candidate block, we calculate the edit distance between the corresponding data sequence and the query sequence. Experimental results show that our method prunes a large portion of the database and guarantees no false negative.
منابع مشابه
Protein Databases
Proteins are sources of many peptides with diverse biological activity. Some of them are considered as valuable components of foods and drug targets with desired and designed biological activity. We are now entering an era rich in biological data in which the field of bioinformatics is poised to exploit this information in increasingly powerful ways. There are currently many databases all over ...
متن کاملUsing Transformation Techniques Towards Efficient Filtration of String Proximity Search of Biological Sequences
The problem of proximity search in biological databases is addressed. We study vector transformations and conduct the application of DFT(Discrete Fourier Transformation) and DWT(Discrete Wavelet Transformation, Haar) dimensionality reduction techniques for DNA sequence proximity search to reduce the search time of range queries. Our empirical results on a number of Prokaryote and Eukaryote DNA ...
متن کاملBFT: A Relational-based Bit Filtration Technique for Efficient Approximate String Joins in Biological Databases
Joining massive tables in relational databases have received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of pairwise whole genome comparison into an approximate join operation in the wellestablished relational database context. We propose a ...
متن کاملAccelerating Smith-Waterman Alignment for Protein Database Search Using Frequency Distance Filtration Scheme Based on CPU-GPU Collaborative System
The Smith-Waterman (SW) algorithm has been widely utilized for searching biological sequence databases in bioinformatics. Recently, several works have adopted the graphic card with Graphic Processing Units (GPUs) and their associated CUDA model to enhance the performance of SW computations. However, these works mainly focused on the protein database search by using the intertask parallelization...
متن کاملEfficient Filtration of Sequence Homology Search through Singular Value Decomposition
Similarity search in textual databases and bioinformatics has received substantial attention in the past decade. Numerous filtration and indexing techniques have been proposed to reduce the curse of dimensionality. This paper proposes a novel approach to map the problem of whole-genome sequence homology search into an approximate vector comparison in the well-established multidimensional vector...
متن کامل